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Abstract —Fast Fourier transform (FFT) of large number of 
samples requires huge hardware resources of field programmable 
gate arrays (FPGA), which needs more area and power. In this 
paper, we present an area efficient architectnre of FFT processor 
that reuses the butterfly elements several times. The FFT pro¬ 
cessor is simulated using VHDL and the results are validated 
on a Virtex-6 FPGA. The proposed architectnre outperforms the 
conventional architecture of a A^-point FFT processor in terms 
of area which is reduced by a factor of logj^ 2 with negligible 
increase in processing time. 
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I. Introduction 

Field programmable gate arrays (FPGA) are programmed 
specifically for the problem to be solved, hence they can 
achieve higher performance with lower power consumption 
than general purpose processors. Therefore, FPGA is a promis¬ 
ing implementation technology for computationally intensive 
applications such as signal, image, and network processing 
tasks m. 

Fast Fourier transform (FFT) is one of the most widely 
used operation in digital signal processing algorithms |2] 
and plays a signihcant role in numerous signal processing 
applications, such as image processing, speech processing, 
software dehned radio etc. FFT processors should be of higher 
throughput with lower computation time. So, for computing 
larger number of data samples, we have to think about the 
area of the FFT processor since the number of stage of FFT 
computation increases with a factor of log 2 N . In the design 
of high throughput FFT architectures, energy-efficient design 
techniques can be used to maximize performance under power 
dissipation constraints. 

Spatial and parallel FFT architecture, also known as array 
architecture 13, based on the Cooley-Tukey algorithm layout, 
is one of the potential high throughput designs. However, the 
implementation of the array architecture is hardware intensive. 
It achieves high performance by using spatial parallelism, 
while requiring more routing resources. However, as the prob¬ 
lem size grows, unfolding the architecture spatially is not 
feasible due to serious power and area issue arisen by complex 
interconnections. 

The pipelined architectures are useful for FFTs that require 
high data throughput 13, Q, lO, El- The basic principle 
of pipelined architectures is to collapse the rows. Radix-2 


multi-path delay commutator H) HI was probably the most 
classical approach for pipeline implementation of radix-2 FFT 
algorithm. Disadvantages include an increase in area due to 
the addition of memories and delay which is related to the 
memory usage cni. 

In this paper, we propose a novel architecture of area effi¬ 
cient FFT by reusing N/2 numbers of butterfly units more than 
once instead of using {N/ 2 )\og 2 N butterfly units once ifTTl . 
This is achieved by a time control unit which sends back the 
previously computed data of N/2 butterfly units to itself for 
(log 2 N) — 1 times and reuses the butterfly units to complete 
FFT computation. The area requirement is obviously smaller, 
only N/2 radix-2 elements, than the array architecture and 
pipelined architectures, N being the number of sample points. 


II. Traditional FFT Algorithm 

The Cooley-Tukey FFT algorithm is the most common 
algorithm for developing FFT. This algorithm uses a recursive 
way of solving FFT of any arbitrary size N. The technique 
divides the larger FFT into smaller FFTs which subsequently 
reduce the complexity of the algorithm. If the size of the FFT 
is N then this algorithm makes N = N1.N2 where N1 
and N2 are sizes of the smaller FFTs. Radix-2 decimation- 
in-time (DIT) is the most common form of the Cooley-Tukey 
algorithm, for any arbitrary size N. N can be expressed as a 
power of 2, that is, N = 2^^, where M is an integer. This 
algorithm is called decimation-in-time since at each stage, 
the input sequence is divided into smaller sequences, i.e. 
the input sequences are decimated at each stage. A FFT of 
N-point discrete-time complex sequence x{n), indexed by 
n = 0,1,...., — 1 is dehned as: 

N-l 

Y{k)=Y,xin)W/i\k = 0,l,...,N-l ( 1 ) 

n—0 

where Wn = Radix-2 divides the FFT into two 

equal parts. The hrst part calculates the Fourier transform of 
the even index numbers. The other part calculates the Fourier 
transform of the odd index numbers and then hnally merges 
them to get the Fourier transform for the whole sequence. 

Seperating the x{n) into odd and even indexed values of 


x{n), we obtain 

N/2-1 N/2-1 

Y{k) = ^ Xe{n)W^% + W^Y1 ^o{n)W^% (2) 

n—0 n—0 

III. Proposed FFT Algorithm 

The area of a FFT processor depends on the total number 
of butterfly units used. Each butterfly unit consists of mul¬ 
tiplier and adder/subtractor blocks. Higher the bit resolution 
of samples, larger the area of these two mathematical blocks. 
According to traditional FFT algorithm each stage contains 
N/2 numbers of butterfly units. Therefore, for a traditional 
FFT processor, the total number of butterfly units is given by 

BUTraditionalFFT = {N/2)\0g2N (3) 

In the proposed algorithm, N/2 number of butterfly units are 
reused for log 2 N times. Therefore, the modified architecture 
of FFT processor requires BUproposedFFT number butterfly 
units which is given by 

BUproposedFFT = N/2 (4) 

The proposed architecture of FFT processor reduces the num¬ 
ber of butterfly units by a factor of (a), which is given by 

_ N/2 

^ ~ N/2 log, N 

= log,iV-i (5) 

= logw 2 

Table shows that the number of multipliers and 
adders/subtractors for the proposed FFT is less compared to 
that of the traditional FFT. 

TABLE I. Comparison of butterfly units, multipliers and 
adders/subtractors 



Traditional FFT 

Proposed FFT 

Butterfly unit (BU) 

A/21og2 N 

Nj2 

Multiplier 

N/2 log, N 

N/2 

Adder/subtractor 

N log 2 N 

N 


TABLE II. Number of butterfly units 


Number of samples 

Traditional 

architecture 

Proposed 

architecture 

8 

12 

4 

16 

32 

8 

32 

80 

16 

64 

192 

32 

128 

448 

64 

256 

1024 

128 

512 

2304 

256 

1024 

5120 

512 


IV. Architecture of Proposed FFT Processor 
The key feature of the proposed FFT processor is its low 
area. The proposed architecture reuses N/2 number of butterfly 
units for log 2 N times. Block diagram of overall architecture 
of proposed FFT processor are shown in Fig[^ It consists of 
routing network, butterfly unit, control unit and input output 
enable blocks. 



Fig. I. Comparison of number of multipliers required in traditional and 
proposed FFT processor 



Fig. 2. Comparison of number of adders/subtractors blocks required in 
traditional and proposed FFT processor 


A. The control unit 

Here the control unit is used for syncronizing all the blocks 
of the FFT processor. It counts the number of stage and 
controls the input, output and feedback databus. The control 
block generates three signals Ibit Input Select Line (ISL), 
output select line (OSL) and multi-bit stage bus (SB). This 
stage bus (SB) contains the stage of FFT computation. The 
control unit increments the number of stage with the rising 
edge of the clock signal. ISL = 0' at initial stage to select 
data from extarnal source after that ISL =' 1' to select data 
from the feedback path or register array for (log 2 A^) — 1 times 
and OSL =' 0' for (log 2 A^) — 1 times to fetch the output 
data of the butterfly unit to register array. At log 2 A^th time 
OSL =' V to enable the output data path of the FFT processor. 
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Fig. 3. Architecture of the proposed FFT processor 
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Fig. 5. Timing diagram of butterfly unit 


D. Routing network and register array 

Routing network unit passes the proper sequences of input 
samples to the butterfly units for different stage of compu¬ 
tations. Value of stage bus (SB) controls output of this unit. 
At first stage, routing network generates bit-reversed sample 
sequence of input samples. For the remain stages, the routing 
network shuffles the feedback samples with distance of 2"*“^ 
where m = 2,3, ...log 2 N. Figure|^shows the data path layout 
of 8-point FFT. Dashed arrows define the feedback samples. 
Cross arrows signifies the butterfly units. Register array CD 
holds the previous data of the butterfly units and passes the 
stored data with the rising edge of the clock. 


B. Butterfly unit (BU) 

From the mathematical diagram, the output samples of 
the butterfly unit is generated after addition and subtraction 
operation with the product of even data sample and twiddle 
factor as shown in Figj^ These butterfly units are clock 



Fig. 4. Architecture of butterfly unit 


capable. The multiplication operation starts with the rising 
edge of the clock and addition or subtraction operation is done 
with the falling edge of the clock signal. So, that total operation 
of the butterfly unit is done within a single clock cycle as 
shown in Fig|^ 
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Fig. 6. Data path layout of 8-point FFT processor 


C. Twiddle factor ROM 

The twiddle factor ROM stores the twiddle factor co¬ 
efficients. Size of this ROM unit is log 2 N x (N/2). This block 
have N/2 number of output signal which are connected with 
N/2 number of butterfly unit. The stage bus (SB) is connected 
with the address bus of ROM. 


V. Implementation and Results 

Figure shows the architecture of 8-point FFT processor 
according to proposed FFT processor. Figure]^ shows the 
timming diagram of this processor. Xlji) and '^K) denotes 
input and output samples and f{k) are the ouput samples of 
previous stage. 

The proposed architecture for 8-point FFT processor is 
coded using VHDL, emulated and synthesized using Xilinx 








































































































Fig. 7. Proposed architecture of 8-point FFT processor 















Clock 



SB -^ 00 ' 

o 

o 

Data In -( X(n) f(k-l) f(k) ) 

Data Out-^ f(k-l) f(k) Y(k) j 

LSI. 



O.ST, 




Fig. 8. Timming Diagram of 8-point FFT 


ISE 14.2 for Virtex-6 FPGA. Table|^ shows the comparison 
of advanced HDL synthesis reports with traditional FFT. Fig¬ 
ure 


10 shows the generated detailed RTF diagram of proposed 
8-point FFT processor. Figure|^ shows the comparison of 
number of DSP slices and FUTs requirements and Table |IV| 
shows the comparison of timming delay with the traditional 
FFT processor. 



TABLE IV. Comparison of delay between traditional and 

PROPOSED FFT PROCESSOR 


TABLE III. Comparison of advanced HDL synthesis reports 


Hardware 

Traditional FFT 

Proposed FFT 

MACS 

24 

8 

Multipliers 

24 

8 

Adder/Subtractors 

72 

25 

Multiplexers 

360 

136 

XORs 

24 

8 

Registers 

- 

288 

counter 

- 

1 



Algorithm 

Delay (nsec) 

Traditional FFT 

29.111 

Proposed FFT 

29.397 


TABLE V. Device utilization and timing summary 


Device Utilization Summary 

Selected Device 

6vsx475tffl759-2 

Number of Slice Registers 

301 out of 595200 

Number of Slice LUTs 

748 out of 297600 

Number of DSP48Els 

16 out of 2016 

Timing Summary 

Minimum period 

19.598ns 

Maximum Frequency 

51.025MHz 

Minimum input arrival 
time before clock 

9.384ns 

Maximum output required 
time after clock 

0.665ns 


VI. Conclusion 

The proposed architecture presents an area efficient Radix- 
2 FFT processor. The algorithm reuses the butterfly units of 
single stage more than once which reduces the area drastically. 
The architecture has been emulated and the performance 
analysis has been carried out in terms of overall response 
time and utilization of hardware resources of FPGA. Detailed 
analysis reveals that the proposed architecture reduces the area 
dramatically without compromising the response time. Further 
improvements may be obtained by designing silicon layout and 
analysing the post-layout performance trade-off 


Fig. 9. Comparison of number LUX and DSP slice between traditional and 
proposed FFT processor 
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